Challenges for Discontiguous Phrase Extraction
نویسندگان
چکیده
Suggestions are made as to how phrase extraction algorithms should be adapted to handle gapped phrases. Such variable phrases are useful for many purposes, including the characterization of learner texts. The basic problem is that there is a combinatorial explosion of such phrases. Any reasonable program must start by putting the exponentially many phrases into equivalence classes (Yamamoto and Church, 2001). This paper discusses the proper characterization of gappy phrases and sketches a suffix-array algorithm for discovering
منابع مشابه
Unsupervised Syntax-Based Machine Translation: The Contribution of Discontiguous Phrases
We present a new unsupervised syntax-based MT system, termed U-DOT, which uses the unsupervised U-DOP model for learning paired trees, and which computes the most probable target sentence from the relative frequencies of paired subtrees. We test U-DOT on the German-English Europarl corpus, showing that it outperforms the state-of-the-art phrase-based Pharaoh system. We demonstrate that the incl...
متن کاملString-to-Tree Multi Bottom-up Tree Transducers
We achieve significant improvements in several syntax-based machine translation experiments using a string-to-tree variant of multi bottom-up tree transducers. Our new parameterized rule extraction algorithm extracts string-to-tree rules that can be discontiguous and non-minimal in contrast to existing algorithms for the tree-to-tree setting. The obtained models significantly outperform the str...
متن کاملA Study of Translation Rule Classification for Syntax-based Statistical Machine Translation
Recently, numerous statistical machine translation models which can utilize various kinds of translation rules are proposed. In these models, not only the conventional syntactic rules but also the non-syntactic rules can be applied. Even the pure phrase rules are includes in some of these models. Although the better performances are reported over the conventional phrase model and syntax model, ...
متن کاملFast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences
In this paper, we present a new technique for the extraction of discontiguous sequential descriptors from text. We are able to form word sequences without any restriction on their size or on the distance between their components. Based on the concept of a maximal frequent sequence (MFS), our approach allows for the extraction of compact text descriptors of quality in a more efficient manner tha...
متن کاملمدل ترجمه عبارت-مرزی با استفاده از برچسبهای کمعمق نحوی
Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...
متن کامل